Recognition and Tagging of Compound Verb Groups in Czech
نویسندگان
چکیده
In Czech corpora compound verb groups are usually tagged in word-by-word manner. As a consequence, some of the morphological tags of particular components of the verb group lose their original meaning. We present a method for automatic recognition of compound verb groups in Czech. From an annotated corpus 126 definite clause grammar rules were constructed. These rules describe all compound verb groups that are frequent in Czech. Using those rules we can find compound verb groups in unannotated texts with the accuracy 93%. Tagging compound verb groups in an annotated corpus exploiting the verb rules is described. K e y w o r d s : compound verb groups, chunking, morphosyntactic tagging, inductive logic programming 1 C o m p o u n d V e r b G r o u p s Recognition and analysis of the predicate in a sentence is fundamental for the meaning of the sentence and its further analysis. In more than half of Czech sentences the predicate contains the compound verb group. E.g. in the sentence Mrzl m~, 2e jsem o td konferenci nev~d~la, byla bych se j l zdSastnila. (literary translation: I am sorry that I did not know about the conference, I would have participated in it.) there are three verb groups Mrzl < / v g > m~, 5e j sem o td konferenci nev~d~la < / v g > , byla bych se j l zdSastnila. < / v g > I am sorry < / v g > that I did not know < / v g > about the conference, I would have participated < / v g > in it. Verb groups are often split into more parts with so called gap words. In the second verb group the gap words are o td konferenci (about the conference). In annotated Czech corpora, including DESAM (Pala et al., 1997), compound verb groups are usually tagged in word-by-word manner. As a consequence, some of the morphological tags of particular components of the verb group loose their original meaning. It means that the tags are correct for a single word but they do not reflect the meaning of the words in context. In the above sentence the word j sem is tagged as a verb in present tense, but the whole verb group to which it belongs jsem nev~d~la is in past tense. Similar situation appears in byla bych se j l zdSastnila (I would have participated in it) where zdSastnila is tagged as past tense while it is only a part of past conditional. Without finding all parts of a compound verb group and without tagging the whole group (what is necessary dependent on other parts of the compound verb group) it is impossible to continue with any kind of semantic analysis. We consider a compound verb group to be a list of verbs and maybe the reflexive pronouns se, si. Such a group is obviously compound of auxiliary and full-meaning verbs, e.g. budu se um~vat where budu is auxiliary verb (like will in English), se is the reflexive pronoun and um~vat means to wash. As word-by-word tagging of verb groups is confusing, it is useful to find a n d assign a new tag to the whole group. This tag should contain information about the beginning and the end of the group and about the particular components of the verb group. It must also contain information about relevant grammatical categories that characterise the verb group as a whole. In (Zg~kov~ and Pala, 1999), a proposal of the method for automatic finding of
منابع مشابه
Corpus-Based Rules for Czech Verb Discontinuous Constituents
In this paper we present a method for extracting general structures of the verb groups from a tagged and fully disambiguated corpus and consecutive exploitation of these structures for the building a formal grammar in the Prolog DCG fashion. Our goal is to apply them as a rules for the analysis of the Czech verb groups in the nondisambiguated grammatically tagged Czech corpus texts. The problem...
متن کاملسیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کاملVerb Detection in Persian Corpus
A novel technique is introduced for verb and inflection detection in Persian texts. This recognition can be useful for preprocessing phase in natural language processing (NLP) and text mining like partof-speech (POS) tagging and sentence boundary detection (SBD) in Persian texts. Our technique employs structural information of Persian verb for the first phase of this detection and then uses the...
متن کاملNew 16-membered macrocyclic Schiff base: Synthesis, structural and FT-IR studies
In this paper, the structure of a new 16-membered macrocyclic Schiff base compound N,N′-(3,3′-dimethoxy-2,2′-(propane-1,3-diyldioxy)dibenzylidene)propane-1,3-diamine, C22H26N2O4 (1), derived from 1,3-propanediamine and 3,3′-dimethoxy-2,2′-(propane-1,3-diyldioxy)dibenzaldehyde has been studied by single crystal X-ray diffraction, DFT calculations at B3LYP/6-31G** and FT-IR spectroscopy. The titl...
متن کاملRule Based Approach for Arabic Part of Speech Tagging and Name Entity Recognition
The aim of this study is to build a tool for Part of Speech (POS) tagging and Name Entity Recognition for Arabic Language, the approach used to build this tool is a rule base technique. The POS Tagger contains two phases:The first phase is to pass word into a lexicon phase, the second level is the morphological phase, and the tagset are (Noun, Verb and Determine). The Named-Entity detector will...
متن کامل